31 de marzo de 2019

Task overview

BUSINESS QUESTION: Which are the top 5 products that are going to be more profitable for the company?

What data do we have?

New product attributes and existing product attributes.

  • Predicting sales of four different product types: PC, Laptops, Netbooks and Smartphones
  • Assessing the impact services reviews and customer reviews have on sales of different product types

Index

  1. Data cleaning

  2. Data exploration

  3. Pre-process: feature selection (correlation matrix) & feature engineering

  4. Modalization: linear regresion, KNN, SVM, Random forest, GBM

  5. Error analysis

Data cleaning

Transformation to factor:

fact_var <- c("ProductType","ProductNum")
ex_prod[,fact_var] <- apply(ex_prod[,fact_var], 2, as.factor)

Giving names to the rows:

ex_prod <- tibble::column_to_rownames(.data = ex_prod,
                                     var = "ProductNum")
ex_prod$ProductNum <- NULL

Data cleaning: missing values with VIM

## 
##  Variables sorted by number of missings: 
##               Variable  Count
##        BestSellersRank 0.1875
##            ProductType 0.0000
##                  Price 0.0000
##          x5StarReviews 0.0000
##          x4StarReviews 0.0000
##          x3StarReviews 0.0000
##          x2StarReviews 0.0000
##          x1StarReviews 0.0000
##  PositiveServiceReview 0.0000
##  NegativeServiceReview 0.0000
##       Recommendproduct 0.0000
##         ShippingWeight 0.0000
##           ProductDepth 0.0000
##           ProductWidth 0.0000
##          ProductHeight 0.0000
##           ProfitMargin 0.0000
##                 Volume 0.0000

1st data expl.: Blackwell business

1st data expl.: Volume distribution

1st modalisation: linear regression

# train and test
train_id <- createDataPartition(y = ex_prod$Volume, p = 0.75, list = F)
train <- ex_prod[train_id,]
test <- ex_prod[-train_id,]

# create linear regression model
mod_lm <- lm(formula = Volume ~ ., data = train)

# model performance
postResample(pred = predict(object = mod_lm, newdata = test),
             obs = test$Volume)
##         RMSE     Rsquared          MAE 
## 1.394552e-12 1.000000e+00 6.345334e-13

Main predictors:

  1. x5StarReviews
  2. ProductTypeGameConsole

2n pre-process: feature selection

2nd modalisation: linear regression

##         RMSE     Rsquared          MAE 
## 1.128152e-12 1.000000e+00 6.432282e-13

Main predictors:

  1. 5 stars
  2. Product type: PC
  3. Price

The model is overfitted again.

3rd data exploration: stars analysis